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The  Reconfigurable  Arithmetic  Processor  (RAP)  is  an  arithmetic  processing  node  for  a 
message-passing,  MIMD  concurrent  computer.  It  incorporates  on  one  chip  several 
serial,  64  bit  floating  point  arithmetic  units  connected  by  a  switching  network.  By 
sequencing  the  switch  through  different  patterns,  the  RAP  chip  calculates  complete 
arithmetic  formulas.  By  chaining  together  its  arithmetic  units  the  RAP  reduces  the 
amount  of  off  chip  data  transfer;  In  the  examples  we  have  simulated  off  chip  I/O  can 
often  be  reduced  to  30%  or  40%  of  that  required  by  a  conventional  arithmetic  chip. 
Simulations  predict  a  peak  performance  of  20M  Flops  with  800M  bit/sec  off  chip 
bandwidth  in  a  2  CMOS  process. 
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Abstract 


The  Rscoaflguruhto  Arithmetic  Processor  (BAP)  is  aa  arithmetic 
processing  node  for  a  messige  pissing.  MIMD  coacttrreat  com- 
patar.  It  iacorporatas  oa  oae  chip  several  aerial,  M  bit 
point  arithmetic  units  connected  by  a  switching  network.  By 
•eqaaadag  the  twitch  through  different  pattens,  the  RAP  chip 
ralntlsree  complete  arithmetic  females.  By  chaining  together 
its  arithmetic  oaits  the  BAP  redone  the  amount  at  off  chip  data 
transfer  ia  the  esampiee  we  ham  timnlated  off  chip  I/O  can 
oftua  he  reduced  to  30%  or  40%  at  that  required  by  n  cocvea- 
tiosal  arithmetic  chip.  SimuUtiooa  predict  a  peak  performance  at 
MMFlope  with  SOOMbit/sac  off  chip  bandwidth  In  a  2am  CMOS 
process. 


betweae  AU*.  Although  a  single  serial  unit  it  slower  thee  a  par¬ 
allel  implementation,  the  BAP  makae  up  for  this  by  exploiting 
the  functional  parallelism  achieved  by  having  several  units  on  one 
chip:  instead  of  a  tingle  20MFlop  unit,  we  have  eight  2 .3 M Flop 
units  running  in  paratlaL  In  the  examples  we  have  simulated.  AU 
utilisation  ranged  Cram  30%  to  60%  depending  on  the  problem. 

The  BAP  datapath  shown  in  Figure  1  consists  of  a  number  of 
four-bit  aerial  A  Us,  a  twitch,  input  registers,  and  output  registers. 
Data  first  eaten  the  switch  end  gets  routed  to  the  appropriate 
Ads.  Intermediate  reeuits  are  fed  bach  iato  the  switch  which  is 
reconfigured  to  allow  the  next  stage  of  the  computation  to  take 
place.  When  the  computation  is  complete  the  results  are  sent  to 
the  output  registers. 


1  Introduction 


U  Summary 

The  prabiww  ia  building  feat  arithmetic  chipe  dam  not  have  to 
da  as  much  with  build  leg  feat  irithmatir  droits  aa  with  briag 
ahla  to  supply  the  lemmmy  I/O  bandwidth.  Tat  -—pi-  a 
coovearioaal  64  blt-parallri  fioeriag  poiat  adder  or  multiplier  pipe 
naaiag  at  20MFVope  requires  at  I/O  bead  width  of  UGbit/aac 
to  be  kept  busy.  Thin  level  of  I/O  is  rwy  difficult  to  achiavw 
with  anything  lam  then  dedicated  ifeas  and  a  trnnrinnmn  stream 
of  data. 

Tha  RaconflguraUa  Arithmetic  Proceaeor  (RAP)  is  a  CMOS,  64 
bit.  Seating  point  arithmetic  chip  designed  to  sustain  high  rates 
of  Soaring  poiat  operations,  while  requiring  only  a  traction  of 
tha  I/O  bandwidth  of  a  couvuu Usual  arithmetic. chip.  To  do  this 
tha  RAP  allows  tha  direct  calculation  of  expceeeioas  the*  centals 
•averal  adds,  subtracts,  and  multiplies.  Ia  affect  the  chip  caa  be 
thought  of  ea  calculating  a  complete  female  rather  than  a  angle 
primitive  operation. 

Tha  BAP  usee  aerial  arithmetic.  Serial  implementations  of  arith¬ 
metic  ere  mote  area  efldaat  thaa  parallel 
allow  at  to  put  several  Arithmetic  Unite  (AUi)  on  a  single  chip. 
Having  narrow  aerial  datapathi  also  allows  oa  to  implement  aa 
ana  efldaat  switching  network  that  caa  ha  used  to  roate  data 

*The  isetaech  d —uni  sfiaihleMpw  was  sappsslad  in  past  by  the  Defease 
Adseaesd  lesssssh  Psujssls  Agency  safes  eeuines  K000u.ee  C  0«a  sad 
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Figure  1:  RAP  Datapath 


At  a  higher  level,  the  BAP  has  a  message  passing  interface,  A 
BAP  is  seat  massages  that  define  equations  as  a  sequence  of 
switch  configurations,  which  are  stand  in  local  memory.  Sub¬ 
sequent  insets  gas  ass  thane  stored  configurations  to  evaluate  the 
aquation.  Mechanisms  am  included  that  allow  the  pipelining  of 
•even!  BAPa  so  that  the  output  of  one  RAP  caa  be  feed  as  tha 
input  to  another. 

U  Background 


Numerically  intensive  computer  applications  such  as  analog  cir¬ 
cuit  simulation,  tha  simulation  of  physical  phenomena  such  as 
N-body  problems,  and  finite  element  analysis,  digital  signal  pro. 
ceasing,  ead  three  dimensional  graphics  require  large  amounts 
of  Souring  poiat  computing  powur  (15).  To  satisfy  this  demand, 
maay  spuriai  parpon  board  level  ead  chip  level  arithmetic  pro- 
caaeon  haws  been  built  [7]  (111  (17).  Approaches  range  from  math 
coprocessor*  that  act  ee  simple  extensions  to  a  mein  processor 
(•■*■  tke  Intel  903(7  and  the  Motorola  MC68S91)  to  dedicated 
math  procoMors  designed  for  specific  applications  In  oust  cases 
thaee  processors  are  implemented  using  a  bit-paraUd  approach. 
Becanm  of  this  approach,  implementations  are  expensive  in  terms 
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of  silicon  un  aad  only  on*  or  two  flatting  point  unit*  cnn  b*  pnt 
on  4  single  chip. 

Th*  non  efteieney  of  serial  arithmetic  niiown  imni  floating 
point  units  to  be  pat  on  n  singi*  chip.  Serial  arithmetic  bn* 
ban  wad  in  many  Digital  Signal  Ptoeesring  tppiiotions  [6|  [8J. 
Tb*  idee  of  u  plot  ting  functional  parallelism  using  s*hnl  lx*d 
point  arithmetic  ban  bon  used  ia  thi*  area  ft).  Many  algorithm 
alternative* «xist  tor  soul  arithmetic  implementation  (l)  [10)  [13| 
(16|. 

Anothar  approach  to  tb*  I/O  bandwidth  pro  blow  is  to  urn  a  (ag¬ 
istor  Si*  to  i tor*  opasands  and  intam odist*  remits  (17).  Th* 
register  fU*  s*rr*s  tb*  snm*  function  as  a  switch,  saisctiag  data 
to  b*  inpat  to  *acb  function  unit  doting  oacb  pipeline  time  ska. 
Tb*  rtgtatar  lb  p ertoras  this  twitching  both  by  storing  lata  to 
boos  it  to  a  difisrant  tin*  slot,  and  by  maltipiudag  many  rsg- 
iatars  into  *acb  ngistac  flb  post.  Tb*  satial  switch  in  tb*  RAP 
tiimiantm  tb*  need  tor  stong*  aid  simplifies  tb*  multiplexing. 
Tb*  (waiting  switch  is  smaller,  both  b*caas*  it  is  serial  and  bn- 
cans*  it  contains  no  storagn  Tb*  twitch  it  also  timplar  to  control: 
twitch  canflgamions  an  chaagsd  each  weed  tin*,  wbib  register 
(U*  addnssss  most  b*  chaagsd  each  clock  cycle.  Th*  slow  control 
signals  allow  us  to  opaenta  th*  switch  fastar  than  a  rompwnhk 
register  SI*.  Throughout  th*  rwt  of  this  paper,  th*  bit-panlbl 
AU  with  no  moans  of  exploiting  locality  is  osad  as  a  basaUb*  l*r*i 
of  comparison. 


Tb*  RAP  chip  i*  b«iag  designed  as  a  part  of  th*  i-Uachin*  [3|, 
a  amssng*  psasing  concurrent  computer  systam  under  dovrfop- 
mant  at  SUT.  This  systam  is  based  on  a  mash  routing  network 
that  connaeu  a  collection  of  p rn rawing  sodas,  and  asm  worm¬ 
hole  ranting  tochniqom  to  rodaca  mmssg*  latency  to  2**  far  a 
MO  bit  aisugs  on  n4R  nod*  network  (8).  Each  riaginckip  nod* 
indadw  both  th*  aotoork  oommaakation  bard  war*  and  a  pm- 


t.  Th*  RAP  chip  b  oa*  node  typo  than  can  It  into  th* 


astwQtk  •riot*'.  It  iaciadw  th*  mmisry  eantiat  nritnlmr 
and  mantags  handling  cnpnhBItbs  to  It  inn*  th*  sysnam.  Th# 
RAP  bortown  snastnl  Idsos  thns  wmoflrst  ibi'riapsdla  th*Mns- 
sags  Drive*  Pmmr  (HOP)  (3)  which  b  th*  pascal  parporn 
compatiag  aod*  tor  the  system  la  pstticalar  th*  RAP  stmeatas 

- p.  Jt — If  -s---I - r  — t  ■ ,  i  —a 

it  makes  use  of  the  same  network  common  iratina  tchsma  [4]  [5). 


1.3  Outlino 

Th*  nmaiadar  of  this  paper  describar  th*  RAP  ia  detail.  Section 
2  gives  an  onampb  of  bow  th*  RAP  is  aasd  is  a  typical  applica¬ 
tion.  Section  3  describes  the  mstmgss  that  are  need  to  control 
the  RAP.  whibasetba  4  ibsrrihw  the  architecture.  Performance 
results  derived  from  dmalattoaa  ere  prweani  ia  wetbn  S.  Soc- 
tioa  6  briefly  lUstnsew  a  simpUflod  Used- point  RAP  chip  that 
bee  beast  fabricated  aad  tested.  Finally,  section  T  offers  soma 
concluding  remark*. 


form  (FFT)  [12)  [14).  Th*  4- point  FFT  dataflow  graph  it  shown 
in  Figure  2  aod  ■v— ia  12  moitiplim  aad  22  addition*  used 
to  calculate  the  real  aad  imagaary  parts  of  th*  4  output  results. 
This  graph  would  bo  evaluated  by  a  RAP  as  follows:  First  a 
■method*  would  bo  stored  ia  tb*  RAP  memory  describing  each 
level  of  tbs  calculation.  The*  a  message  would  be  sent  containing 
the  14  input  variables  necessary  fur  tb*  computation.  Assuming 
an  ideal  totting,  tb*  RAP  would  successively  run  through  each 
level  of  tbs  cslculatioe  ss  described  by  tb*  method,  exploiting 
functional  parallelism  by  doing  ait  operations  of  a  given  l*v*i  in 
parallel.  Finally  it  would  send  a  maesag*  containing  tb*  results 
to  tb*  appropriate  dmtiaatic*. 

In  n  realistic  setting,  determining  tb*  taccmaivt  conflgintioas  of 
n  method  involves  a  problem,  since  tb*  RAP  may  not 

have  enough  AUs  to  perform  ail  po*sibU  concurrent  operations 
at  ones.  Tb*  RAP  we  are  building  baa  4  addut/subtractars  and 
4  multipliers.  An  optimum  sebodub  for  tb*  graph  of  Figure  2  is 
shown  ia  Figure  3:  tb*  4-point  FFT  actually  requires  7  compu¬ 
tation  cydm.  In  this  application  a  tO%  of  the  RAPs  capacity  ia 
uaod  tor  a  rate  of  12  MFlopa. 


2  An  gocunpim 


Figure  2:  4- Point  FFT  Dataflow  Graph 


In  order  to  iUuatraU  bow  tb*  RAP  am*  fan  ct  ice  si  parallelism  to 
exploit  the  locality  aad  concurrency  bead  ia  mathematical  eqaa- 
tinea,  we  nwriitor  the  calculation  appoint  Fast  Fourier  Tran*- 


Tb*  I/O  bandwidth  required  i*  reduced  to  2S%  ol  th*  bandwidth 
required  by  a  conventional  bit- parallel  arithmetic  chip.  A  conven¬ 
tional  arithmetic  chip  would  require  34  x  3  ■  102  word  transfers. 


whan  M  corresponds  to  tht  number  at  operations,  sod  3  corre¬ 
sponds  to  tht  two  words  of  input  data  and  oat  word  a f  output 
fat  tach  operation.  Using  a  RAP,  only  26  words  must  ba 
transferred  ou  aad  off  chip,  consisting  o f  14  input  operands,  3 
output  malts,  aad  4  words  o f  ovarhaad  inform  at  inn.  Thin  n- 
ducad  I/O  bandwidth  makes  it  poaaibia  tor  a  commuaicatiflas 
cat  work  to  kaap  tha  chip  busy. 


Flgon  J:  Oparatioa  Schaduia  tar  tha  Optra lious  ol  tha  4-poiat 
ITT  Dataflow  Graph 


3  Mmh|N 

Than  an  thna  typaa  of  massages  than  tha  RAP  paaanaaa  la 
order  to  aupport  the  types  afopaniinaa  described  ahossi 

L  COWTIOTRE  AMD  EXECUTE  (C+B).  This  mssaags  rtntaa 
operands  to  be  loaded  into  cha  iapet  register*,  passed  through 
am  or  naan  switch  configurations,  and  than  snlnudsd  ton 
tha  output  ngjstan.  This  iartpaattd  tar  such  sat  cf  operands 

X  STORE  METHOD  (SM).  Tbit  mataafa  is  usad  to  tton 
a  atathod  in  local  mammy  to  that  it  cau  bu  uaad  by  tha 
C+E  massage.  A  method  daacribua  a  sequence  a f  twitch 
configurations  aacataaty  to  parfacn  a  calculation 

J.  STORE  TEMPLATE  (ST).  This  muaaegi  is  used  to  rtote  a 
tempi  ata  in  local  memory.  A  tampion  contains  forwarding 
information  that  allows  tha  rastadiag  o f  aetanl  RAP  chips. 


dartinniioa  of  tha  ntuitt.  Tha  NODE-ID  is  tha  network  address 
of  a  noa-RAP  nods,  aad  tha  REPLY-10  is  s  message  header. 
Than  two  fields  an  used  in  conjunction  with  template  infor¬ 
mation  to  forward  output  results.  Tha  information  contained 
in  methods  aad  templates  is  discussed  in  detail  in  tht  following 
sections. 
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Figure  4:  Massage  Formats 


3.1  Methods 

A  method  consists  of  sH  tha  information  necessary  to  put  operands 
through  a  sequence  of  switch  oonfigarationa.  It  includes: 

1.  Which  input  registers  to  load  with  tha  operands. 

2.  Which  output  registers  will  contain  the  results. 

J.  Tha  aamhar  at  twitch  configurations  that  tha  operands  era 
to  go  through. 

A  A  iesrriptioa  of  each  of  the  oouflguratlous.  A  configuration 
mutslna  him  dastrildag  tha  twitch  connectivity,  aad  bits 
that  fatsimlai  tha  fsactinnsjity  at  the  Ada  (a*,  a  bit 
might  detarmiaa  whether  an  AU  doaa  aa  add  or  a  subtract). 

Tha  first  thna  pitcua  of  Information  an  pachad  into  ana  word  (#4 
hits)  of  tha  atathod  description.  Each  configuration  also  tehee  one 
word.  Thu  number  of  sets  of  operands  that  ertil  ba  passed  through 
using  a  given  method  is  not  included  la  tha  method  description 
since  it  can  deduced  bom  tha  and  of  maaaaga  signal. 

3.3  Templates 

A  template  is  usad  to  permit  tha  cascading  of  several  RAP  chips. 
It  rnetaias  information  that  allows  tha  forwarding  of  tha  oatput 
data,  la  tha  form  of  a  C+E  maaaaga.  to  another  RAP  for  further 
computation.  A  template  consists  of  tha  address  of  tha  next  RAP 
that  the  results  an  to  ba  teat  to,  aad  tha  instruction  (method 
sad  tap  plate)  that  is  to  be  executed  than. 


Massage  formate  an  show*  in  Figure  4.  Tha  C+E  maaaaga 

has  METHOD- ID  aad  TEMPLATE- ED  fields  that  specify  tha 
mathod  sad  template  to  bn  aaad.  Method  and  tampinta  IDs  an 
memory  adilrmam  that  point  at  tha  first  dement  of  the  method  at 
template  Methods  aad  turn  pistes  mart  ba  seat  to  tha  RAP  be¬ 
fore  tha  C+E  commands  that  an  them.  Tha  C+E  manage  also 
baa  NODE-ID  sad  REPLY-ID  fisids  that  specify  tha  ultimate 


Cascading  of  RAPa  works  at  Sallows:  A  general  purpose  nods 
tush  sa  a  MDP  [2|  sect  up  the  pipeline  by  lonrting  methods  sad 
tarn  pistes  into  the  appropriate  RAP  chips.  A  C+E  maaaaga  is 
than  seat  to  tha  first  RAP  in  tha  pipeline  to  begin  tha  calculation. 
Each  SAP  is  tha  pipeline  usee  its  template  to  forward  its  results 
to  tha  next  RAP  tor  tha  next  stage  of  tha  computation.  The 
NODE-ID  aad  REPLY-ID  an  pssaad  from  RAP  to  RAP  until 
they  in  used  at  the  last  stage  to  gat  tha  results  to  thoir  final 


d— tinatioa.  Ia  tom*  cut  it  it  oocomary  to  rnmbi**  molts 
cooiaf  boa  diScnat  SAPi  befora  continuing  tb*  computation: 
tbit  combining  eta  b«  doa*  by  aa  MDP. 

Tempiat—  an  specified  separately  from  tb*  method  to  allow  dif- 
Streat  calculations  to  at*  com  mo*  subraetia—  tad  to  permit  a 
Slagle  rtifuiatin*  to  dittribal*  it*  Work  ov*»  several  HAPt.  Aa 
example  at  two  calrnlatinos  using  a  rnmian*  subroutine  would 
b*  a  routine  that  maltipli**  all  tb*  tUoMat*  at  two  •acton.  De¬ 
pending  oa  which  Itm^s*”  it  and  to  forward  tb*  fault,  tocb  a 
routine  could  ba  used  by  itull  or  could  b*  and  ia  a  dot  product 
raatia**h*r*tb***xtst*pistoaddallth*pradactaloc(th*c.  A 
RAP  could  tin  it*  tb*  dUStfuat  t—aplat—  to  divid*  up  a  piob- 
Urn  n  that  tb*  ant  dip  at  tba  (•alrulatina  it  bring  uarutad 
oa  a  number  o t  dUbnat  BAP*.  la  tbi*  on*  several  BAPi  aoald 
coataia  tb*  tan*  matted  aad  tb*  choica  ad  t*nplat*»  aoald  dia- 
tribut*  tb*  esork  ovar  tb***  procamots. 


4  Architecture 

4.1  Block  Diagram 

Figure  S  steere  a  block  diagram  of  tb*  complete  BAP  cooti«tiag 
of  tb*  control  block*,  tb*  amaoria*,  aad  tb*  datapath.  Then  an 
tbn*  coatral  block*  rafsrrad  to  a*  lapai  coatral,  output  coatral, 
aad  seritcb  coatral,  that  cootdiaat*  tb*  aacutkm  of  mnaagn 
Input  coatral  haadlaa  tba  racaptioa  of  tamaagn,  tb*  iapat  to  tb* 
datapath,  aad  memory  op*ratioaa.  Oatpat  coatral  ia  respoast- 
bl*  far  laloadtag  malt  aamigm  iato  tba  oatpat  gam*,  white 


twitch  coatral  i*  caapoaatbi*  far  loading  switch  configurations  at 
cb*  comet  time.  By  dividing  tb*  coatral  into  than  thrau  diff«r- 
at  block*  tb*  operation  of  loadiag  op*rand»,  unloading  ratulta, 
and  changing  tb*  twitch  configuration  can  b*  pipelined.  Hard- 
wan  iaMriocka  wsoivu  memory  cootantlon  aad  provid*  feedback 
to  piavaat  tb*  output  queue  from  overflowing. 

Thmara thru* ■until* oath* chip:  amain  memory  for  bolding 
t«a plain  aad  methods,  aa  input  qura*.  aad  aa  output  qu*u* 
Tb*  11  put  aad  oatpat  qutuut  an  64  word  mtmoriut  with  septra 
porta  far  tba  outwork  aad  preempt.  Tb*  mala  memory  (2So 
word*)  it  aband  botwna  tba  iaput  coatral  and  tb*  switch  coatral, 
with  priority  glean  to  tba  switch  coatral. 

Tb*  doupotb  consist*  of  It  iapat  ragman,  a  switch,  a  switch 
configuration  register,  a  rnllortiaa  of  16  function  units,  16  out¬ 
put  ragiftan,  aad  too*  buffer  storage  for  tb*  ttmplat*.  Tb*  16 
function  aait*  cnorim  of  4  plui/miau*  arithmetic  units  (AUs), 
4  multiply  AOs,  aad  S  feedthroughs  used  to  pan  op* reads  un¬ 
changed  with  tba  appropriate  delay. 

Opera* ds  an  loaded  iato  tb*  iaput  registers  from  tb*  iaput 
quaa*,  propagate  through  tb*  function  units  for  oa#  or  mom 
‘com pu cation  cycler”  (which  it  dadaad  to  b*  tb*  tim*  for  oa* 
p*a*  through  tb*  (witch  aad  functional  nails)  with  tb*  outputs  of 
tb*  AOs  aad  feedthroughs  feeding  back  iato  tba  seritcb,  aad  than 
an  unloaded  from  tb*  output  ragman  iato  tba  output  qu*ut. 
Tba  iaput  aad  output  ragman  perform  parallel,  serial  aad  atrial- 
paraUui  conversion  respectively,  aad  can  be  loaded  or  unloaded 
a*  tb*  switch  aad  AO*  an  buy  computing  another  problem  in¬ 
stance.  After  each  computation  cyd*  tb*  switch  is  reconfigured 
by  tba  switch  coatral  aait  which  reloads  tb*  seritcb  configuration 
ragtetur.  TVs  appropriate  tea  pi  ate  is  unloaded  iato  tb*  output 
quae*  brines  tba  oatpat  malts. 


4.3  Arithmetic  Unite 

The  BAT  faded—  udder /sub  trecton,  and  nultipUun.  Wa  asti- 
■at*  that  tboua  malt*  wUl  raa  at  60  Uhl  fa  a  2pm  CMOS  proc-a. 
For  ri—piirity  name,  a  una  sfuadurd  flouting  point  format  era* 
cbo—m,  coaristiag  of  *a  6  bit,  two't  compliment  exponent  laid, 
aad  a  M  hit,  twnb  mnplnn—t  ma*tl— a  Arid.  This  format  per- 
■it*  a  uniform  tr— rmsut  of  tb*  «cpoa«at  aad  maatijaa  ia  tw 
coaplaoaac  farm.  Tb*  im  piemen  ration  is  four-bit  —rial  ia  of 
to  auk*  full  a—  of  the  dock  period.  Ia  single  bit  implement, 
tioas  ngnals  propagate  ia  timn  much  small—  than  tb*  small— t 
dock  parted  that  can  b*  reliably  distnbutad,  aad  thus  do  not 
mak*  full  u—  of  tb*  clock  period.  Manipulating  two  or  four  bit* 
at  a  Ua*  alto  allow*  cartaia  efficient  —rial  algorithms  to  be  u—d 
[1)  [10].  Ana  is  appmnmaMd  to  b*  4MA1  far  aa  AU  aad  SOOKA1 
for  a  t—d  through. 

Tb*  AUs  raa  four  dm—  fester  than  tb*  mamory.  We  d*aot*  a 
memory  cyd*  a*  a  major  cyd*  aad  u  AU  cyd*  a*  a  minor  cyd*. 
A  word  dm*  is  defined  —  tb*  tint*  required  to  shift  a  complete 
operand  iato  aa  AU,  aad  carra ponds  to  16  minor  cyd—  or  4 
major  cyd**.  Tb*  units  have  a  latency  of  two  word  tim—  ia  which 
tb*  exponent  ad  aunties*  era  Combated,  end  dbrfiaQtaSbalt' 
performed. 


Figera  S:  BAP  Block  Diagram 


4.3  Switch 


5  Performance  Evaluation 


The  switch  is  shown  ia  Figaro  #.  Each  AD  safoct*  oh  at  8  inputs 
for  inch  of  th mt  two  operands,  while  the  feedthroughs  each  h*«* 
the  data  at  2  input*.  Oa  the  Ant  configuration  at  ear  prm 
Bathed  the  inputs  in  ttkaa  horn  the  16  input  rugiatsro,  while  oe 
•ebeeqaeut  configuration*  the  iepatt  are  tehee  boa  the  oetpate 
at  the  AH*  and  fosdthrougbs.  The  celaae  o l  2X1  aaitipleeaa 
it  seed  to  auha  thie  choice. 

The  twitch  rhmm  doaa  not  offer  the  complete  connectivity  a t 
e  16X16  crossbar  bet  hae  the  advaataga  of  being  aach  mailer 
and  requiring  lea  ttea  inform  atinu-  tech  configuration  cae  be 
dmcribad  by  uriag  90  bitt  (J  bite  fat  each  AO  lapel,  l  hit  far  each 

Jeedthioegh,  tad  1  hit  far  each  aide* /tab tnetor  shit  to  teleet  ita 
fenetfoaj  whkh  fit*  into  a  angle  64  bit  word.  ThUaflowt  aehaagt 
of  the  twitch  coatgentioa  ia  a  dagle  clock  cycle  by  reedlig  a 
single  weed  hoot  aeooey.  Foe  the  p  rob  la  at  wa  entitled,  the 
incomplete  coaaectivity  did  aot  prevent  at  boa  mapping  the 
problem  tOdmtly  onto  the  twitch.  The  main  reeeoa  Cot  thie 
wee  that  daring  aay  given  stags  of  the  calcalstina  not  all  the 
AO*  are  aeeded.  to  that  it  it  aaey  to  toate  oatpalt  to  the  darired 
iapeta  by  rbnoeieg  between  several  poaaibie  free  AUs. 

Aa  trpneeine  compiler  ia  needed  to  map  a  given  equation  or  tel 
of  aquations  into  a  terita  of  appropriate  switch  configurations.  A 
compiler  ia  cartaatly  under  development  baaed  oa  a  critical  path 
analysis  of  the  anprteti on  and  a  greedy  tchedaling  of  operations. 


A  rimalator  for  the  RAP  architect  are  hat  Pees  written  and  used 
to  verify  control  aad  to  evaluate  performance.  This  simulator  al¬ 
lows  the  tending  of  mssaagsa  to  the  RAP,  aa  wail  aa  the  cascading 
of  RAP«.  Performance  figure*  ateame  a  minor  cycia  of  12 Jna.  a 
major  cycle  of  50a*,  aa  iapat  bandwidth  of  tOOMbit/tec,  aad  aa 
oatpat  bandwidth  of  400Mbi!/iec  (5|. 

A  Bomber  of  formolat  have  been  mapped  into  the  RAP  aad  tome 
of  those  result*  ant  shown  ia  Table  1.  For  each  problem  in  this 
table  we  list  the  total  comber  of  floating  point  operations  per¬ 
formed,  the  aamber  of  iapat  operands  and  oatpat  oparaoda,  aad 
the  n amber  at  twitch  configuration*  ia  the  method  seed  to  de¬ 
fine  the  rslnittioa  From  these  Agues  we  ralrolate  the  avenge 
floating  point  rata  achieved  aad  the  I/O  bandwidth  required  to 
heap  a  RAP  buy  with  tha  problem.  The  latency  column  refen 
to  tha  time  from  whan  ana  problem  instance  it  in  the  iapat  buffer 
to  when  tha  complete  result  ii  in  the  oatpat  buffer  aad  includes 
all  control  orarhaad. 

Average  floating  point  performance  achieved  depends  on  how  well 
a  problem  is  able  to  sat  the  parallelism  made  available  by  the 
RAP.  The  avenge  floating  point  performance  for  three  problems 
was  9  If  Flops  or  45%  of  tha  peak  performance  possible.  Opti¬ 
misation  is  posable  in  tome  of  thus  cases  by  combining  several 
problem  ia  order  to  mss  more  of  the  resources;  for  instance  deeag 
two  2X2  FFTs  at  the  tame  time  uses  the  RAP  more  efficiently 
then  a  dagle  2X2FFT.  Tha  I/O  bandwidth  required  depends  an 
the  locality  iahamat  in  tha  problem.  Fat  instance  the  vector  earn 
example  has  no  locality  that  can  be  exploited  by  the  RAP  and 
than  is  no  bandwidth  advantage  in  using  it  (in  foot  if  overhead 
is  included,  using  the  RAP  la  mom  costly).  The  bandwidth  re¬ 
quired  for  the  other  pwhiama  however  havo  all  bean  rodeoed  to 
within  the  capacity  of  oar  network.  Tbs  I/O  headwidth  required 
wot  increase  sightly  because  of  cammaakation  aad  menage  han¬ 
dling  overhead.  The  percentage  coat  of  this  overhead  depends  an 
haw  assay  seta  af  ~y  Triads  an  sent  in  a  single  msengs. 
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Table  1:  RAP  performance  in  typical  applications 


Tha  key  feature  of  the  RAP  is  that  it  reduces  the  data  trans¬ 
fer  bandwidth  that  tha  network  meet  certain  to  do  arithmetic 
sAdeatly.  Matching  the  network  speed  to  the  RAP 
sposd  in  aa  important  consideration.  To  achieve  balance,  each 
met  bird  should  have  seffldest  configurations  to  hoop  the  RAP 
buy  without  overloading  tha  network,  aad  fow  enough  for  the 


Figure  6:  Switch  Configuration 


Flfin  T:  HAP  Prototype 


HAP  to  keep  up  with  tha  network.  Table  2  showe  how  map 
configuration  ia  a  method  would  ba  ideal,  given  tha  aamber  of 
iapat  operand!  Dow  thaw  figuree  imaa  aa  galoped  netwocfc 
if  than  ia  othar  traffic  on  tha  aatwoafc  tha  I/O  bandwidth  will 
iticraeee.  tha  tiaa  far  a  con  plan  set  of  oparaada  to  arrive  aad 
tha  idaal  aanhar  of  configurations  par  method  will  both  inooun. 
Any  miamatrh  ia  apaad  batwaaa  tha  aotwotk  aad  tha  HAP  caa 
ba  waaawhat  eompaaaalad  far  by  tha  iapat  aad  oatpat  buffers, 
bat  ia  tha  wont  case  nay  hack  ap  tha  aetesork. 

6  Prototype  Hardware 

A  HAP  taat  chip  (Figure  7)  baa  bean  fahricatad  aad  tasted  fa' 
UOSIS  2 an  Scalabla  CMOS  technology  by  MIT  atadaau  St*, 
art  Flaha,  Joaaf  Shaoni,  aad  Pair  Spacak  ia  ordar  to  investigate 
aana  of  tha  idaaa  daacribad  above,  ia  partkalar  tha  idea  of  bar- 
lag  a  recoafigurabla  data  path.  It  constats  of  12,  Id-bit,  two-bit 


•vial,  find  point  arithowtic  units  connected  by  statically  racon- 
figunbla  (pane  cnaabar  (witchn.  The  datapath  ia  a  three  stage 
pipeline,  each  stage  one  4  AUa  aad  ia  connected  to  the  next  stage 
by  a  (witch.  Each  AU  takas  three  operands  aad  ia  capable  of  do¬ 
ing  aialtipliratina,  addition,  subtraction  of  two  of  its  oparaada 
while  passisg  the  third  unchanged,  or  of  multiplying  two  of  its 
operands  aad  adding/ subtracting  tha  third.  Tiro  register  filet 
stole  iapat  ud  oatpat  operands  and  perform  parallel-serial  and 
serial- parallel  conversion. 

Although  the  switch  setup  is  different  thaa  that  of  the  floating 
point  HAP,  this  chip  demonstrates  that  the  switch  caa  be  effi¬ 
ciently  implemented:  about  12%  of  the  total  chip  area  is  devoted 
to  th*  switch  and  switch  control,  and  this  percentage  will  be 
much  — '*it—  in  tha  caee  of  the  44  bit  floating  point  operations 
because  the  AOs  and  registers  will  be  much  bigger. 
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7  Conclusion 


The  Rsconfigurebls  Arithmetic  Processor  i*  a  a  special  purpose 
processor  specifically  designed  to  fit  iato  a  massage  peering  coo- 
currsat  computer  system.  The  RAP  attempts  to  rapitalise  os 
the  simplicity  of  serial  arithmetic  sad  the  performance  benefit* 
of  ftiartionel  parallelism  to  create  a  poe»eif»l  looting  poiat  trith- 
metic  single  chip  pie  csss  nr.  By  changing  the  my  ia  which  its 
serial  srithmetir  unite  am  iatamoaaoctad,  the  RAP  caa  provide 
addltioaal  lambility  by  adorning  complete  arithmetic  formaiae  to 
be  ralralatisd  eR  at  onto  without  latenaadlete  leeelte  going  off 
chip  or  to  local  mamory.sahat  anti  ally  radadag  tha  data  traasfiw 
repaired. 

The  ooacapte  seed  ia  the  RAP  provide  the  maaae  to  efficiently  am 
messspe  pairing  to  achieve  high  perform  ears  aumericei  compet- 
ioC-  A  1034  node  msasage  pesetas  concurrent  computer  system 
with  138  RAPs  would  provide  over  2GFlops  of  peak  compntias 

poewr. 

Much  work  ram  sine  to  be  accomplished  at  the  implementation 
level.  The  dariga  of  the  aerial  finatiag  paint  units  ia  partica- 
lar  is  critical  ead  raises  several  issues  related  to  the  aoaarical 
aspects  of  Boating  paint  inriadiag  haarillag  overflow,  underflow, 
rounding,  and  dsanrmalised  lumbers.  Tiers  is  slso  the  pores, 
tlel  of  pipeUaiag  tern  problems  through  each  floating  point  unit 
ia  order  to  iocmaaa  performance. 

Area#  for  further  reaaarch  include  iaraatigatiag  ways  of  imple- 
meating  dirisioo  ia  tha  tame  framework,  aa  well  u  iavesti gating 
how  to  taka  advantags  of  local  memory  ia  a  RAP  to  further  re- 
dace  the  baadsridth  constraint*.  Ia  particular,  allowing  constant* 
to  bn  specified  *s  part  of  a  method  would  reduce  I/O  bandwidth 
ia  many  ream  Giving  the  RAP  more  control  over 
so  that  it  is  leas  depeadeat  cm  off  chip  control  may  also  lead  to 
reduced  bandwidth  requirements  sad  increased  flexibility. 


[l|  Dally  WJ.,  *A  High  Performance  VLSI  Quaternary  Serial 
Multiplier*,  Awe.  [CCD -37.  pp.  649-653. 

[2)  Dally,  W.  J.  et.*i„  ‘.Architecture  of  a  Message- Driven  Pro¬ 
cessor,''  Proceedings  o/the  U“  ACM /IE  EE  Symposium  on 
Computer  Architecture,  .’ana  1987,  pp  189-196. 

[3)  Daily  WJ.,  et.ui. ,  *Coacumnt  Computer  Architecture’ 
Pros,  a f  Syrup,  on  Parallel  Computations  and  Their  Impact 
on  Mechanics,  1987. 

[4)  Dully  WJ.,  Saits  C.L.,  *  Tha  Tons  Routing  Chip*.  J. 
Dietrihmted  System*,  VoL  1,  No.  3, 1986,  pp.  187-196. 

[51  Dally  WJ.,  Song  P„  ‘Dariga  of  a  Self-Timed  VLSI  Multi¬ 
computer  Communication  Controller",  Proc.  ICCD-3 7,  pp. 
230-334. 

[61  Deayer  P.,  Raashaw  W.,  VLSI  Signal  Processing:  A  Pit- 
Serial  Approach,  Addison- Wesley  Publishing  Company, 
1985. 

[7]  Gosling  J.B.,  Zurawski  J.H.P.,  Edwards  D.B.J..  ‘A  Chip 
Set  for  High-Speed  Low  Coat  Floating  Poiat  Unit*.  Proc. 
5th  Symposium  on  Computer  Arithmetic.  IEEE  Computer 
Society  Press,  1981,  pp.  50-55. 

[8]  Lyon  R.F.,  “A  Bit-Serial  VLSI  Architectural  Methodology 
Car  Signal  Processing*,  VLSF81,  ad.  J.P.  Gray,  Academic 
Pirns,  1981,  pp.  131-140. 

[9|  Lyon  R-F  ,  ‘MSSP:  A  Bit-Serial  Mnltiprocmaor  for  Sig- 
sal  Proreerisg*  VLSI  Signal  Praumng;  A  Bit-Serial  Ap¬ 
proach,  Deayer,  » sash  aw.  Chapter  13,  Addiaoa-Wealey 
Publishing  Company,  1965 

[10]  Lyon  U-,  "Two’s  Camplsmsat  Pipeline  Multipliers", 
IEEE  Trent.  Comm.,  Vul  COM-34,  April  1976,  pp.  418- 
425. 

[Ill  McAllister  W.H„  Carieoa  J.R.,  “Floatiag-Poiat  Chip  Set 
Speeds  Real-Time  Computer  Operation”,  Hewlett- Pochard 
Journal,  February  1984.  pp.  17-23. 

[12]  Oppenheim  A.V..  Schafer  R.W.,  Digital  Signal  Processing, 
Prentice-Hail  Inc.,  Englewood  Cliffs,  New  Jersey,  1975- 

[13]  Owens  R.M-,  “Compound  Algorithms  for  Digit  On-line 
Arithmetic*,  5th  Sgmp.  Compel.  Anth.,  Ana  Arbor,  MI, 
May  1981,  pp.  64-71. 

(14j  Rnbiaar  L.W..  Gold  B„  Theory  and  Applications  of  Digi¬ 
tal  Signal  Pncueing,  Prentice- Hail  Inc..  Eagle  wood  Cliffs. 
New  Jersey,  1975. 

[15)  Reach  K.,  ‘Math  Chips:  how  they  work’,  IEEE  Spectrum, 
July  1987,  pp.  35-30. 

[18]  Trivedi  K.S.,  Ercegova c  M.D.,  ‘On-line  Algorithms  for  Di- 
virion  aad  Multiplication*,  IEEE  Trane.  CompuL,  voi.  C- 
26,  no.  7,  pp. 681-667,  July  1977. 

[17]  Weitak  Corporatioe,  ‘WTU064/1065  High  Spend  64-bit 
IEEE  Floating  Point  Mnltipiiar/ ALU*  Preliminary  Date 
1984. 


